-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Various Configurable Initializations #161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I think the implementation is correct (checked with the paper). I only had 2-3 questions where things were not clear to me. Regarding the integration into modalities, I think we need more iterations fixing some issues about the dependencies and coupling.
I'm not the biggest fan of having the initialisation in the parent class of the models (i.e., NNModel).
The reason is that the weight initialisation is pretty much dependent on the concrete model implementation. For instance, the gpt2 model must have the c_proj
parameters initialised in certain ways, whereas CoCa for instance has different named parameters. Currently, we do string matching in the parent class which introduces low level, model-specific dependencies in the parent class. Similarly, this can be the case for modules, e.g., a custom linear layer that we introduce in a particular model. Also in this case, we would need custom initialisation in the parent class. For each of these special cases we would have to modify the parent class, strongly coupling dependencies.
We could resolve this inverse dependency by introducing an generic WeightInitializer
class, that initializes the weights of a model in a generic, configurable way. The WeightInitializer
class would be passed to the constructor of the concrete model. The model would then call something like weight_initalizer.init_weights(self, weight_init_config)
. Basically, the strategy pattern that modifies the calling object (here, the model) in place.
We should have a generic WeightInitializer
covering the general cases. For specific models, we can also introduce new WeightInitializer
s that are specific to a certain model. These WeightInitializer
s should be instantiable as part of the hierarchical instantiation, as we do for the other components.
The config for the model and WeightInitializer
would look like this:
weight_initializer:
<weight init config...>
model:
component_key: model
variant_key: gpt2
config:
n_layer: 2
n_head_q: 8
n_head_kv: 4
ffn_hidden: 128
weight_initializer:
instance_key: weight_initializer
pass_type: BY_REFERENCE
Additionally, I left a bunch of mostly minor comments regarding some questions and ideas.
… init. Need to check how we want to handle this case
…or embedding during initialization
…rameterwiseNormalInitialization to support also plain initialisation
…en std is of type float
Draft: Feat/initialization component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
What does this PR do?
This PR implements the following weight initializations (see https://arxiv.org/abs/2312.16903):
A weight initialisation component is introduced which modifies the model weights in place (see #168 for more details)
General Changes
std
can now be set to the stringauto
(in which case it will equalsqrt(2/(5*hidden_dim))
, see e.g. https://arxiv.org/abs/2312.16903)Breaking Changes
Checklist before submitting final PR
python tests/tests.py
)